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Abstract 

In machine learning contests such as the ImageNet Large Scale Visual Recognition Challenge IRDS+151 
and the KDD Cup, contestants can submit candidate solutions and receive from an oracle (typically the 
organizers of the competition) the accuracy of their guesses compared to the ground-truth labels. One of 
the most commonly used accuracy metrics for binary classification tasks is the Area Under the Receiver 
Operating Characteristics Curve (AUC). In this paper we provide proofs-of-concept of how knowledge 
of the AUC of a set of guesses can be used, in two different kinds of attacks, to improve the accuracy 
of those guesses. On the other hand, we also demonstrate the intractability of one kind of AUC exploit 
by proving that the number of possible binary labelings of n examples for which a candidate solution 
obtains a AUC score of c grows exponentially in n, for every c € (0,1). 


Introduction 

Machine learning and data-mining competitions such as the ImageNet Large Scale Visual Recognition Challenge 
||RDS'*~15| . KDD Cup IIKDD15I . and Facial Expression Recognition and Analysis (FERA) Challenge llVGA~*~15l have 
helped to advance the state-of-the-art of machine learning, deep learning, and computer vision research. By establish¬ 
ing common benchmarks and setting a clearer boundary between training and testing datasets - participants typically 
never gain access to the testing labels directly - these competitions help researchers to estimate the performance of 
their classifiers more reliably. However, the benefit of such contests depends on the integrity of the competition and the 
generalizability of performance to real-world contexts. If contestants could somehow “hack” the competition to learn 
illegitimately the labels of the test set and increase their scores, then the value of the contest would diminish greatly. 

One potential window that data-mining contestants could exploit is the accuracy “oracle” set up by the competition 
organizers to give contestants a running estimate of their classifier’s performance; In many data-mining contests (e.g., 
those organized by Kaggle), contestants may submit, up to a fixed maximum number of times per day, a set of guesses 
for the labels in the testing set. The oracle will then reply with the accuracy of the contestant’s guesses on a (possibly 
randomized) subset of the test data. There are several ways in which these oracle answers can be exploited, including; 
(1) If some contestants were able to circumvent the daily maximum number of accuracy queries they can submit to the 
oracle (e.g., by illegally registering multiple accounts on the competition’s website), then they could use those extra 
oracle responses to perform additional parameter or hyperparameter optimization over the test set and potentially gain 
a competitive edge. (2) The accuracy reported by the oracle, even though it is an aggregate performance metric over 
the entire test set (or a large subset), could convey information about the labels of individual examples in the test set. 
Contestants could use this information to refine their guesses about the test labels. Given that the difference in accuracy 
between contestants is often tiny - in KDD Cup 2015, for example, the #1 and #2 contestants’ accuracies differed by 
0.0003 AUC 0KDD151 - even a small edge achieved by exploiting the AUC oracle can be perceived as worthwhile. 

To date, attacks of type (1) have already been implemented and documented IISiml5l . However, to the best of 
our knowledge, attacks of type (2) have not previously been investigated. In this paper, we consider how an attacker 
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could orchestrate attacks of type (2) on an oracle that reports the Area Under the Receiver Operating Characteristics 
Curve (AUC), which is one of the most common performance metrics for binary classifiers. Specifically, we make the 
following contributions; 

(a) We describe an attack whereby an attacker whose classifier already achieves high AUC and who knows the 
prevalence of each class can use the oracle to infer the labels of a few test examples with complete certainty. 

(b) We provide a proof-of-concept of how the AUC score c of a set of guesses constrains the set of possible labelings 
of the entire test set, and how this information can be harnessed, using standard Bayesian inference, to improve 
the accuracy of those guesses. 

(c) We show that a brute-force attack of type (b) above is computationally tractable only for very small datasets. 
Specifically, we prove that the number of possible binary labelings of n examples for which a candidate solution 
obtains a AUC score of c grows exponentially in n, for every c G (0,1). 

As the importance and prominence of data-mining competitions continue to increase, attackers will find more and 
more ingenious methods of hacking them. The greater goal of this paper is to raise awareness of this potential danger. 


Related Work 

Our work is related to the problem of data leakage, which is the inadvertent introduction of information about test 
data into the training dataset of data-mining competitions. Leakage has been named one of the most important data- 
mining mistakes IIMNEI09II and can be surprisingly insidious and difficult to identify and prevent llKBF~*~00llKRPS12l . 
Leakage has traditionally been defined in a “static” sense, e.g., an artifact of the data preparation process that causes 
certain features to reveal the target label with complete certainty. The exploitation of an AUC oracle can be seen as 
a form of “dynamic” leakage: the oracle’s AUC response during the competition to a set of guesses submitted by a 
contestant can divulge the identity of particular test labels or at least constrain the set of possible labelings. 

Our research also relates to privacy-preserving machine learning and differential privacy (e.g., BDwol II ICM09I 
IBLR13I ). which are concerned with how to provide useful aggregate statistics about a dataset - e.g., the mean value of 
a particular attribute, or even a hyperplane to be used for classifying the data - without disclosing private information 
about particular examples within the dataset. IISCM14I . for example, proposed an algorithm for computing “private 
ROC” curves and associated AUC statistics. The prior work most similar to ours is by IIMH13I . who show how an 
attacker who already knows most of the test labels can estimate the remaining labels if he/she gains access to an 
empirical ROC curve, i.e., a set of classifier thresholds and corresponding true positive and false positive rates. In a 
simulation on 100 samples, they show how a simple Markov-chain Monte Carlo algorithm can recover the remaining 
10% of missing test labels, with high accuracy, if 90% of the test labels are already known. 


ROC, AUC, and 2AFC 


One of the most common ways to quantify the performance of a binary classifier is the Receiver Operating Charac¬ 
teristics (ROC) curve. Suppose a particular test set has ground-truth labels j/i,... G {0; 1} ^nd the classifier’s 
guesses for these labels are yi,. .., G K. For any fixed threshold 0 G M, the false positive rate of these guesses is 
FP(0) = ^ J2iey- positive rate is TP(0) = ^ — ^1’ where ![•] G {0,1} is an in¬ 
dicator function, y~ = {i : yi = 0} and = 1} are the index sets of the negatively and positively labeled 

examples, and uq = The ROC curve is constructed by plotting (FP(0),TP(0)) for all possible 9. 

The Area Under the ROC Curve (AUC) is then the integral of the ROC curve over the interval [0,1]. 

An alternative but equivalent interpretation of the AUC BTCOOl IaGH~*~ 05I| . which we use in this paper, is that it 
is equal to the probability of correct response in a 2-Alternative Forced-Choice (2AFC) task, whereby the classifier’s 
real-valued outputs are used to discriminate between one positively labeled and one negatively labeled example drawn 
from the set of all such pairs in the test set. If the AUC is 1, then the classifier can discriminate between a positive 
and a negative example perfectly (i.e., with probability 1). A classifier that guesses randomly has AUC of 0.5. Using 
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this probabilistic definition, given the ground-truth labels yi-n = yi, ■ ■ ■ ,yn and the classifier’s real-valued guesses 
yi-.n = yi, ■ ■ ■, Vn, we can define the function / to compute the AUC of the guesses as: 

f{yi:u,yi-.n) 

iey- jey+ 

In this paper, we will assume that the classifier’s guesses yi,..., are all distinct (i.e., yi = ijj i = j), which 

in many classification problems is common. In this case, the formula above simplifies to: 

f{yi-.u,yi-.n) = E (1) 

iey- jey+ 

As is evident in Eq.[T] all that matters to the AUC is the relative ordering of the iji, not their exact values. Also, if all 
examples belong to the same class and either ni = 0 or no = 0, then the AUC is undefined. Finally, the AUC is a 
rational number because it can be written as the fraction of two integers p and q. We use these facts later in the paper. 


Exploiting the AUC Score: A Simple Example 

As a first example of how knowing the AUC of a set of guesses can reveal the ground-truth test labels, consider a 
hypothetical tiny test set of just 4 examples such that yi G {0,1} is the ground-truth label for example z G {1,2, 3,4}. 
Suppose a contestant estimates, through some machine learning process, that the probabilities that these examples 
belong to the positive class are jji = 0.5, y 2 = 0.6, ya = 0.9, and y^ = 0.4, and that the contestant submits these 
guesses to the oracle. If the oracle replies that these guesses have an AUC of 0.75, then the contestant can conclude 
with complete certainty that the true solution is yi = 1, y 2 = 0, j/a = 1, and 1/4 = 0 because this is the only ground- 
truth labeling satisfying Eq.[T](see Table [^l. The contestant can then revise his/her guesses based on the information 
returned by the oracle and receive a perfect score. Interestingly, in this example, the fact that the AUC is 0.75 is 
even more informative than if the AUC of the initial guesses were 1, because the latter is satisfied by three different 
ground-truth labelings. 

This simple example raises the question of whether knowledge of the AUC could be exploited in more general 
cases. In the next sections we explore two possible attacks that a contestant might wage. 


Attack 1: Deducing Highest/Lowest-Ranked Labels with Certainty 

In this section we describe how a contestant, whose guesses are already very accurate (AUC close to 1), can orchestrate 
an attack to infer a few of the test set labels with complete certainty. This could be useful for several reasons: (1) Even 
though the attacker’s AUC is close to 1, he/she may not know the actual test set labels - see Table[^for an example. If 
the same test examples were ever re-used in a subsequent competition, then knowing their labels could be helpful. (2) 
Once the attacker knows some of the test labels with certainty, he/she can now use these examples for training. This 
can be especially beneficial when the test set is drawn from a different population than the training set (e.g., different 
set of human subjects’ face images llVJM~*~lll ). (3) If multiple attackers all score a high AUC but have very different 
sets of guesses, then they could potentially collude to infer the labels of a large number of test examples. 

The attack requires rather strong prior knowledge of the exact number of positive and negative examples in the test 
set (ni and no, respectively). 

Proposition 1. Let D be a dataset with labels z/i,..., ?/„, of which uq are negative and ni are positive. Let yi,... ,yn 
be a contestant’s real-valued guesses for the labels ofD such that iji = ijj i = j. Let c = f{yi-.n,yi:n) denote 

the AUC achieved by the real-valued guesses with respect to the ground-truth labels. For any positive integer k < no, 
ifc > 1 ~ + noni ’ the first k examples according to the rank order ofiji ,..., must be negatively labeled. 

Similarly, for any positive integer k < ni, ifc >1-^-1- then the last k examples according to the rank order 

ofyi,...,ijn must be positively labeled. 
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AUC for different labelings 
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0 
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0 

0 
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1 

0 

1 

0 

0.75 

1 

0 

1 

1 

« 0.33 

1 

1 

0 

0 

0.50 

1 

1 

0 

1 

0.00 

1 

1 

1 

0 

1.00 

1 

1 

1 

1 
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Table 1: Accuracy (AUC) achieved when a contestant’s real-valued guesses of the test labels are yi = 0.5, 1/2 = 
0.6,1/3 = 0.9, ^4 = 0.4, shown for each possible ground-truth labeling. Only for one possible labeling (highlighted) 
do the contestant’s guesses achieve AUC of exactly 0.75. 


Proof. Without loss of generality, suppose that the indices are arranged such that the iji are sorted, i.e., yi < ... < ijn. 
Suppose, by way of contradiction, that m of the first k examples were positively labeled, where 1 < m < k. For each 
such possible m, we can find at least m{no — k) pairs that are misclassified according to the real-valued guesses by 
pairing each of the m positively labeled examples within the index range {i : 1 < i < k} with each of (no — k) 
negatively labeled examples within the index range {j : k + 1 < j < n}. In each such pair (i, j), yi = 1 and yj = 0, 
and yet yi < ijj, and thus the pair is misclassified. The minimum number of misclassified pairs, over all m in the 
range k}, is clearly riQ — k (for m = 1). Since there are uqUi total pairs in V consisting of one positive and 

one negative example, and since the AUC is maximized when the number of misclassified pairs is minimized, then the 
maximum AUC that could be achieved when m > 1 of the first k examples are positively labeled is 

^ _ no- k _ ^ k 

ngui ni noUi 

But this is less than c. We must therefore conclude that m = 0, i.e., all of the first k examples are negatively labeled. 
The proof is exactly analogous for the case when c > 1 — -I- □ 

Example 

Suppose that a contestant’s real-valued guesses yi,..., achieve an AUC of c = 0.985 on a dataset containing 
no = 45 negative and ni = 55 positive examples, and that no and are known to him/her. Then the contestant can 
conclude that the labels of the first (according to the rank order of yi,..., y„) 7 examples must be negative and the 
labels of the last 17 examples must be positive. 


Attack 2: Integration over Satisfying Solutions 

The second attack that we describe treats the AUC reported by the oracle as an observed random variable in a standard 
supervised learning framework. In contrast to Attack 1, no prior knowledge of uq or ni is required. Note that the 
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Figure 1; Without node C, this graphical model shows a standard supervised learning problem; after estimating 9 
(on training data, not shown), the test labels Yi,..., can be estimated from feature vectors Xi,..., Xn, and then 
submitted to the organizers of the competition. Node C represents the contestant’s accuracy (AUC), which is often 
provided by an oracle and can be leveraged to improve the guesses for the test set labels. Only the shaded variables 
are observed. 


attack we describe uses only a single oracle query to improve an existing set of real-valued guesses. More sophisticated 
attacks might conceivably rehne the contestant’s guesses in an iterative fashion using multiple queries. Notation; In 
this section only, capital letters denote random variables and lower-case letters represent instantiated values. 

Consider the graphical model of Fig. [T] which depicts a test set containing n examples where each example i is 
described by a vector Xi G K’" of m features (e.g., image pixels). Under the model, each label Vi G {0, 1} is generated 
according to P(Y = 1 I Xi, 9) = g{xi, 9), where g : K"* x K"* —)■ [0,1] could be, for example, the sigmoid function 
of logistic regression and 9 G K™ is the classiher parameter. Note that this is a standard probabilistic discriminative 
model - the only difference is that we have created an intermediate variable F) G [0,1] to represent g(xi, 9) explicitly 
for each i. Specihcally, we dehne; 

F(gi 1x^,9) = 6{y^ - g{xi,9)) 

PO^i = 1 I 2/i) = Vi 

where <5 is the Dirac delta function. 

The classihcation parameter 9 can be estimated from training data (not shown), and thus we consider it to be 
observed. Using Xi-n and an estimate for 9, the contestant can then compute Ti:„ and submit these as his/her guesses 
to the competition organizers. The question is; if the competition allows access to an oracle that reports C (i.e., variable 
C is observed), how can this additional information be used? Since the AUC C is a deterministic function (Eq.[T]) of 
yi,n and yi:„, we have; 

P{C I yi:n,yi:n) = iVl-.n, Vl-.n) “ c) 

Then, according to standard Bayesian inference, the contestant can use C to update his/her posterior estimate of each 
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label Yi as follows: 


P{y^ I yi-.n,c) 

= '^Piyi-.n\yi-.n,c) 


= E 


where y^i refers to j/i,..., ..., 

P{c I yi-.n,yi:n)P{yi-.n \ IJl-.n) 


oc 


E 


P{c I yi,n) 

P{c I yi:n,yi:n) YlPiVj I %) 


= E yi:") “ c) P{yj I yj) 

y^i j 

E 

!/-i : j 


( 2 ) 


In other words, to compute the (unnormalized) posterior probability that Yi = yi given C = c, simply find all label 
assignments to the other variables Yi,, Fi-i, Fi+i ,... ,Yn such that the AUC is c, and then compute the sum of 
the likelihoods P{yj \ ijj) over all such assignments. 


Simulation 

As a proof-of-concept of the algorithm described above, we conducted a simulation on a tiny dataset of n = 16 
examples. In particular, we let g{x,9) = (1 + exp(—(sigmoid function for logistic regression), and we 
sampled 9 and Xi,..., Xn from an m-dimensional Normal distribution with zero mean and diagonal unit covariance. 
In our simulations, the contestant does not know the value of 9 but can estimate it from a training dataset containing 
k examples (with L 2 regularization of 1). In each simulation, the contestant computes Yi-^ from its estimate of 9 and 
the feature vectors Xi.„. The contestant then submits Yi-n as guesses to the oracle, receives the AUC score C, and 
then computes P{Yi = 1 | iji-n, c),..., P{Yn = 1 | iji-.n, c). The contestant then submits these posterior probabilities 
as its second set of guesses and receives an updated AUC score C. After each simulation run, we record the accuracy 
gain C — C. By varying k G {1,..., 20}, m G {4, 5,..., 16}, and averaging over 50 simulation runs per {m, k) 
combination, we can then compute the expected accuracy gain AAUC (i.e., C — C) as a function of the initial AUC 

iC). 

Results: The graph in Fig. [^indicates that, for a wide range of starting AUC values C, a small but worthwhile 
average increase in accuracy can be achieved, particularly when C is closer to 0.5. 

Tractability 

In the simulation above, we used a brute-force approach when solving Eq.|^ we created a list of all 2" possible binary 
tuples (j/i,..., yn) and then simply selected those tuples that satisfied /{yi-.n, yi-.n) = c- However, if the number of 
such tuples were small, and if one could efficiently enumerate over them, then the attack would become much more 
practical. This raises an important question: for a dataset of size n and a fixed set of real-valued guesses yi,..., are 
there certain AUC values for which the number of possible binary labelings is sub-exponential in n? We investigate 
this question in the next section. 
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Figure 2: Tiny simulation of how exploiting knowledge of the AUC of a set of guesses can improve accuracy, as a 
function of existing accuracy. 
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Figure 3; Construction of a binary labeling for which the AUC is c, for any c = | such that 0.5 < c < 1. 
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The Number of Satisfying Solutions Grows Exponentially in n for Every AUC 

c e (0,1) 

In this section we prove that the number of tuples (j/i,..., for which a contestant’s guesses achieve a hxed AUC 
c grows exponentially in n, for every c G (0,1). (Note that this is different from proving that, for a dataset of some 
hxed size n, the number of tuples (j/i,..., ?/„) for which a contestant’s guesses achieve some AUC c is exponential in 
n.) Our proof is by construction: for any AUC c = p,q G p < q, we show how to construct a dataset of size 

n = Aq such that the number of satisfying binary labelings is at least (2 — 2 |c — 0.5|)"'^^. Intuitively, this lower-bound 
grows more quickly for AUC values close to 0.5 than for AUC values close to 0 or 1. 

First, we prove a simple lemma: 

Lemma 1. Let a,b,c G Z such that a > b > 0 and c > 0. Then 

Proof. By way of contradiction, suppose ^. Then 

{a + c)b > a{b + c) 
ab + bc > ab + ac 
be > ac 


which implies b > a. □ 

Proposition 2. Let D be a dataset consisting ofn = 4q examples for some q G and letyi,... ,ynbe a contestant’s 
real-valued guesses for the binary labels of D such that iji = yj i = j. Then for any AUC c such that 

c= ^,P,q G Z'^ and p < q, the number of distinct binary labelings yi,... ,yn for which f{yi.n,yi:n) = cisatleast 

(2-210-0.51)"/^ 

Proof. Without loss of generality, suppose that the indices are arranged such that the yi are sorted, i.e., yi < ... < y„. 
Since the AUC is invariant under monotonic transformations of the real-valued guesses, we can represent each {ji 
simply by its index i. Since c is a fraction of pairs that are correctly classihed, we can write it as p/q for positive 
integers p, q. We will handle the cases c > 0.5 and c < 0.5 separately. 

Case 1 (0.5 < c < 1): Construct a dataset of size n = 4q as shown in Fig. the hrst 2(2p — q) entries 
are negative examples, and of the remaining 2{3q — 2p) entries (which we call the “right band”), 2q are positive 
and 4:{q — p) are negative. Moreover, the right band is arranged symmetrically in the following sense: for each i G 
{2(2p -q) -hi,..., 4q}, yi = 1 yj = 1 where j = 2{2p - q) -h {4q - i-h 1). 

Given this construction, we must calculate how many pairs containing one positive and one negative example are 
correctly and incorrectly classihed. Note that each of the hrst 2(2p — q) negative examples in the left band can be 
paired with each of the 2q positive examples in the right band, and that each of these positive examples has a higher 
y value than the negative examples; hence, each of these 2q{2){2p — q) = 8pq — 4q^ pairs is classihed correctly. 
The only remaining pairs of examples occur within the right band. To calculate the number of correctly/incorrectly 
classihed pairs in the right band, we exploit the fact that it is symmetric: For any index pair {i,j) where i,j G 
{2(2p — q) -h 1,..., 4q}, where yi = 0 and yj = 1, and where i < j (and hence yi < yj), we can hnd exactly one 
other pair (^^/) for which yi/ = 0 and yji = 1 and for which i' > j'. Hence, within the right band, the numbers of 
correctly and incorrectly classihed pairs are equal. Since there are 2q{4){q — p) = — 8pq pairs within this band 

total, then 4y^ — 4pq are classihed correctly and 4q'^ — 4pq are classihed incorrectly. 

Summing the pairs of examples within the right band and the pairs between the left and right bands, we have 
8pq — 4q^ -h 8q^ — 8pq = 4q‘^ pairs total. The number of correctly classihed pairs is 4q^ — 4pq -h 8pq — 4q^ = 4pq, 
and thus the AUC is 4pq/{4q^) = p/q = c, as desired. 

Now that we have constructed a single labeling of size n = 4q for which the AUC is c, we can construct many 
more simply by varying the positions of the 2q positive entries within the right band of 2(3y — 2p) entries. To preserve 
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symmetry, we can vary the positions of half of the positive examples within half of the tight band and then simply 
“reflect” the positions onto the other half In total, the number of choices is: 


^ (3g- 2p)! 

q\{3q-2p-qy. 

{iq-2py. 

q\{2q-2py. 

(3g - 2p){3q -2p-l)---{2q-2p + 2){2q - 2p + 1) 
9 ( 9 - 1 )---( 2 )( 1 ) 

We now apply Lemma [T] and the fact that the numerator and denominator both contain q terms: 


Case 2 (0 < c < 0.5): The proof is analogous except that we “flip” the AUC around 0.5 and cotTespondingly “flip” 
the labels left-to-right. Let r = q — p (so that r/q = 1 — p/q). Then, we form a similar construction as above, except 
that the left sequence of all negative examples is moved to the right. Specifically, we create a symmetric sequence of 
length 2{3q — 2r) such that 2q examples are positive and 4(g — r) are negative. We then append 2(2r — q) entries to 
the right that are all negative. This results in 

i {2q X 4(g- r)) = - Aqr 


pairs that are classified correctly and 


2(2r - g) X 2g + i (2g X A{q - r)) 

= 8qr - Aq^ + Aq^ - Aqr 
= Aqr 

pairs that are classified incorrectly. In total, there are Aqr+Aq"^—Aqr = Aq"^ pairs, so that the AUC is (Aq'^—Aqr) / Aq"^ = 
I — rIq = pjq, as desired. 
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Analogously to above, we can form 


3q — 2r 

q 


i3q-2r)l 
q\{2q — 2r)l 


> 


3q — 2r 


1/4 


3q - 2{q - 

q 

q + 2p'^ 


= (l + 2c)”/^ 


symmetric constructions of the left band. 

Combining Case 1 and Case 2, we find that the number of binary labelings for which the AUC is c is at least 

(2-2|c-0.5|)”/'‘ 


for all 0 < c < 1. 


□ 


Conclusion 

In this paper we have examined properties of the Area Under the ROC Curve (AUC) that can enable a contestant of 
a data-mining contest to exploit an oracle that reports AUC scores to illegitimately attain higher performance. We 
presented two simple attacks: one whereby a contestant whose guesses already achieve high accuracy can infer, with 
complete certainty, the values of a few of the test set labels; and another whereby a contestant can harness the oracle’s 
AUC information to improve his/her guesses using standard Bayesian inference. To our knowledge, our paper is the 
first to formally investigate these kinds of attacks. 

The practical implications of our work are mixed: On the one hand, we have provided proofs-of-concept that 
systematic exploitation of AUC oracles is possible, which underlines the importance of taking simple safeguards such 
as (a) adding noise to the output of the oracle, (b) limiting the number of times that a contestant may query the oracle, 
and (c) not re-using test examples across competitions. On the other hand, we also provided evidence - in the form of 
a proof that the number of binary labelings for which a set of guesses attains an AUC of c grows exponentially in the 
test set size n - that brute-force probabilistic inference to improve one’s guesses is intractable except for tiny datasets. 
It is conceivable that some approximate inference algorithms might overcome this obstacle. 

As data-mining contests continue to grow in number and importance, it is likely that more creative exploitation - 
e.g., attacks that harness multiple oracle queries instead of just single queries - will be attempted. It is even possible 
that the focus of effort in such contests might shift from developing effective machine learning classifiers to querying 
the oracle strategically, without training a useful classifier at all. With our paper we hope to highlight an important 
potential problem in the machine learning community. 
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